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Abstract 

We consider the following sample selection problem. We observe in an online fashion a sequence of 
samples, each endowed by a quality. Our goal is to either select or reject each sample, so as to maximize 
the aggregate quality of the subsample selected so far. There is a natural trade-off here between the rate 
of selection and the aggregate quality of the subsample. We show that for a number of such problems 
extremely simple and oblivious "threshold rules" for selection achieve optimal tradeoffs between rate of 
selection and aggregate quality in a probabilistic sense. In some cases we show that the same threshold 
rule is optimal for a large class of quality distributions and is thus oblivious in a strong sense. 
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1 Introduction 



Imagine a heterogeneous sequence of samples from an array of sensors, having different utilities reflecting 
their accuracy, quality, or applicability to the task at hand. We wish to discard all but the most relevant or 
useful samples. Further suppose that selection is performed online — every time we receive a new sample 
we must make an irrevocable decision to keep it or discard it. What rules can we use for sample selection? 
There is a tradeoff here: while we want to retain only the most useful samples, we may not want to be overly 
selective and discard a large fraction. So we could either fix a rate of selection (the number of examples we 
want to retain as a function of the number we see) and ask for the best quality subsample, or fix a desirable 
level of quality as a function of the size of the subsample and ask to achieve this with the fewest samples 
rejected. 

An example of online sample selection is the following "hiring" process that has been studied previously. 
Imagine that a company wishing to grow interviews candidates to observe their qualifications, work ethic, 
compatibility with the existing workforce, etc. How should the company make hiring decisions so as to 
obtain the higest quality workforce possible? As for the sensor problem, there is no single correct answer 
here. Rather a good hiring strategy depends on the rate at which the company plans to grow — again there 
is a trade-off between being overly selective and growing fast. Broder et al. [71 studied this hiring problem 
in a simple setting where each candidate's quality is a one-dimensional random variable and the company 
wants to maximize the average or median quality of its workforce. 

In general performing such selection tasks may require complicated rules that depend on the samples 
seen so far. Our main contribution is to show that in a number of settings an extremely simple class of rules 
that we call "threshold rules" is close to optimal on average (within constant factors). 

Specifically, suppose that each sample is endowed with a "quality", which is a random variable drawn 
from a known distribution. We are interested in maximizing the aggregate quality of a set of samples, which 
is a numerical function of the individual qualities. Suppose that we want to select a subset of n samples 
out of a total of T seen. Let Q^, denote the maximum aggregate quality that can be achieves by picking 
the best n out of the T samples. Our goal is to design an online selection rule that approximates Q* Tn in 
expectation over the T samples. We use two measures of approximation — the ratio of the expected quality 
achieved by the offline optimum to that achieved by the online selection rule, E[Q^ n ]/ E[QT : n], and the 
expectation of the ratio of the qualities of the two rules, E\Q* T / 'Qt ,-«]• Here the expectations are taken 
over the distribution from which the sample is drawn. The approximation ratios are always at least 1 and 
our goal is to show that they are bounded from above by a constant independent of n. In this case we say 
that the corresponding selection rule is optimal. 

To put this in context, consider the setting studied by Broder et al. Q- Each sample is associated with a 
quality in the range [0, 1], and the goal is to maximize the average quality of the subsample we pick. Broder 
et al. show (implicitly) that if the quality is distributed uniformly in [0, 1] a natural select above the mean 
rule is optimal to within constant factors with respect to the optimal offline algorithm that has the same 
selection rate as the rule. The same observation holds also for the select above the median rule. Both of 
these rules are adaptive in the sense that the next selection decision depends on the samples seen so far. In 
more general settings, adaptive rules of this kind can require unbounded space to store information about 
samples seen previously. For example, consider the following 2-dimensional skyline problem: each sample 
is a point in a unit square; the quality of a single point (x, y) is the area of its "shadow" [0, x] x [0, y], and the 
quality of a set of points is the area of the collective shadows of all the points; the goal is to pick a subsample 
with the largest shadow. In this case, a natural selection rule is to select a sample if it falls out of the shadow 
of the previously seen points. However implementing this rule requires remembering on average O(logn) 
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samples out of n samples seen (H. We therefore study non-adaptive selection rules. 

We focus in particular on so-called "threshold rules" for selection. A threshold rule specifies a criterion 
or "threshold" that a candidate must satisfy to get selected. Most crucially, the threshold is determined a 
priori given a desired selection rate; it depends only on the number of samples picked so far and is otherwise 
independent of the samples seen or picked. Threshold rules are extremely simple oblivious rules and can, in 
particular, be "hard-wired" into the selection process. This suggests the following natural questions. When 
are threshold rules optimal for online selection problems? Does the answer depend on the desired rate of 
selection? We answer these questions in three different settings in this paper. 

The first setting we study is a single-dimensional-quality setting similar to Broder et al.'s model. In this 
setting, we study threshold rules of the form "Pick the next sample whose quality exceeds f(i)" where i is 
the number of samples picked so far. We show that for a large class of functions / these rules give constant 
factor approximations. Interestingly, our threshold rules are optimal in an almost distribution-independent 
way. In particular, every rule / in the aforementioned class is simultaneously constant-factor optimal with 
respect to any "power law" distribution, and the approximation factor is independent of the parameters of 
the distribution. In contrast, Broder et al.'s results hold only for the uniform distributiorQ. 

In the second setting, samples are nodes in a rooted infinite-depth tree. Each node is said to cover all the 
nodes on the unique path from the root to itself. The quality of a collection of nodes is the total number of 
distinct nodes that they collectively cover. This is different from the first setting in that the quality defines 
only a partial order over the samples. Once again, we study threshold rules of the form "Pick the next sample 
whose quality exceeds f(i) " and show that they are constant factor optimal. 

Our third setting is a generalization of the skyline problem described previously. Specifically, consider 
a domain X with a probability measure p, and a partial ordering -< over it. For an element x G X, the 
"shadow" or "downward closure" of x is the set of all the points that it dominates in this partial ordering, 
T>{x) = {y : y -< x}; likewise the shadow of a subset S C X is T>(S) = U xg 5''D(x). Once again, as in 
the second setting, we can define the coverage of a single sample to be the measure of all the points in its 
shadow. However, unlike the tree setting, here it is usually easy to obtain a constant factor approximation to 
coverage — the maximum coverage achievable is 1 (i.e. the measure of the entire universe), whereas in many 
cases (e.g. for the uniform distribution over the unit square) a single random sample can in expectation 
obtain constant coverage. We therefore measure the quality of a subsample S C X by its "gap", Gap(5) = 
1 — fj,(D(S)). In this setting, rules that place a threshold on the quality of the next sample to be selected 
are not constant-factor optimal. Instead, we study threshold rules of the form "Pick the next sample x for 
which p,(U(x)) is at most f(i)", where U{x) = {y : x -< y} is the set of all elements that dominate x, or 
the "upward closure" of x, and show that these rules obtain constant factor approximations. 

1.1 Related work 

As mentioned earlier, our work is inspired by and extends the work of Broder et al. Q. Broder et al. consider 
a special case of the one-dimensional selection problem described above. They assume that the quality of 
a sample is distributed uniformly over the interval (0, 1); this assumption is not without loss of generality. 
They analyze two adaptive selection rules — select above the mean, and select above the median — and show 
that both are constant-factor optimal , although they lead to different growth rates. These rules are adaptive 
in the sense that the next selection decision depends on the quality of the samples accepted so far. Note that 
the select above the median rule requires the algorithm to remember all of the samples accepted so far, and 

While Broder et al.'s result can be extended to any arbitrary distribution via a standard tranformation from one space to another, 
the resulting selection rule becomes distribution dependent, e.g., "select above the mean" is no longer "select above the mean" w.r.t. 
the other distribution upon applying the transformation. 
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is therefore a computationally intensive rule. Even the relatively simpler select above the mean rule requires 
remembering the current mean and number so far accepted. In contrast we show (Section [3]) that there 
exists a class of simple non-adaptive selection strategies that also achieves optimality and includes rules 
with selection rates equal to those of the ones studied by Broder et al. These strategies make decisions based 
only on the number hired so far. Furthermore we extend these results to more general coverage problems. 

Our third setting is closely related to the skyline problem that has been studied extensively in online 
settings by the database community (see, for example, [2] and references therein). Kung, et al. lfl2l gave an 
offline divide-and-conquer algorithm that finds the skyline of a given set of vectors in d-dimensional space. 
Their algorithm uses 0(nlog 2 n) comparisons of vector components when d = 2,3 and 0(n(log 2 n) d ~ 2 ) 
when d > 4. The implementation of a Skyline query for database systems was recently introduced by |6j. 
The closest in spirit to our work is lfl5l . They considered a stream of uncertain objects to model uncertainty 
in measurement. Each object has an associated set of possible instances and they are interested in the objects 
whose probability of being dominated by another object is at most some q supplied by the database user. 

Online sample selection is closely related to secretary problems, however there are some key differences. 
In secretary problems (see, e.g., HHHEl) there is typically a fixed bound on the desired number of hires. In 
our setting the selection process is ongoing and we must pick more and more samples as time passes. This 
makes the tradeoff between the rate of hiring and the rate of improvement of quality interesting. 

Finally, while our goal is to analyze a class of online algorithms in comparison to the optimal offline 
algorithms, our approach is different from the competitive analysis of online algorithms Q. In competitive 
analysis the goal is to perform nearly as well as the optimal offline algorithm for any arbitrary sequence of 
input. In contrast, we bound the expected competitive ratio of the rules we study. Furthermore, a crucial 
aspect of the strategies that we study is that not only are they online, but they are also non-adaptive or 
oblivious. That is, the current acceptance threshold does not depend on the samples seen by the algorithm 
so far. In this sense, our model is closer in spirit to work on oblivious algorithms (see, e.g., ifTTl [3l flOl ). 
Oblivious algorithms are highly desirable in practical settings because the rules can be hard-wired into the 
selection process, making them very easy to implement. The caveat is, of course, that for many optimization 
problems oblivious algorithms do not provide good approximations. Surprisingly, we show that in many 
scenarios related to sample selection, obliviousness has only a small cost. 

2 Models and results 

Let X be a domain with probability measure \i over it. A threshold rule X is specified by a sequence of 
subsets of X indexed by N: X = Xq D X\ D X% D • • • D X n D • • • . A sample is selected if it belongs to 
Xi where i is the number of samples previously selected. 

Let Tbe an infinite sequence of samples drawn i.i.d. according to \i. Let T x (n) denote the prefix of T 
such that the last sample on this prefix is the nth sample chosen by the threshold rule X; let denote the 
length of this prefix. We drop the superscript and the subscript when they are clear from the context. The 
"selection overhead" of a threshold rule as a function of n is the expected waiting time to select n samples, 
or E[T*], where the expectation is over T. 

Let Q be a function denoting "quality". Thus Q(x) denotes the quality of a sample x and Q(S) the 
aggregate quality of a set S C X of samples. Q(x) is a random variable and we assume that it is drawn 
from a known distribution. Let Q%- denote the quality of an optimal subset of n out of a set T of samples 
with respect to measure Q. We use Q* n as shorthand for Q^ x ^ n where X is clear from the context. Let 

Q* x ^ n (Q n for short) denote the quality of a sample of size n selected by threshold rule X with respect 
to measure Q. 
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We look at both maximization and minimization problems. For maximization problems we say that a 
threshold rule X achieves a competitive ratio of a in expectation with respect to Q if for all n, 



"or 

Qn 



< a 



< a 



Likewise, X a-approximates expected quality with respect to Q if for all n, 

Et~u [Qn] 

For minimization problems, the ratios are defined similarly: 

Qn 



Exp. comp. ratio = max E-j^^ 



Qn 



; Approx. to exp. quality = max 



Et~u [Qn] 



n E T ^[Q* ni 



We now describe the specific settings we study and the results we obtain. 



Model 1: Unit interval (Section [3]>. Our first setting is the one-dimensional setting studied by Broder et 
al. Q. Specifically, each sample is associated with a quality drawn from a distribution over the unit line. Our 
measure of success is the mean quality of the subsample we select. Note that in the context of approximately 
optimal selection rules this is a weak notion of success. For example when [i is the uniform distribution over 
[0, 1], even in the absence of any selection rule we can achieve a mean quality of 1/2, while the maximum 
achievable is 1. So instead of approximately maximizing the mean quality, we approximately minimize the 
mean quality gap — Gap(S') = 1 — (X^es x )/\^\ — °f tne subsample. 

We focus on power-law distributions on the unit line, i.e. distributions with c.d.f. — x) = 1 — x k for 
some constant k, and study threshold rules of the form X{ = {x : x > 1 — Cj} where q = fi(l/poly(i)). We 
show that these threshold rules are constant factor optimal simultaneously for any power-law distribution. 
Remarkably, this gives an optimal selection algorithm that is oblivious of even the underlying distribution. 
Formally we obtain the following result. 

Theorem 1 For the unit line equipped with a power-law distribution, any threshold rule Xi = {x : x > 
1 — Cj}, where Ci = l/i a with < a < lfor all i, achieves an 0(1) approximation to the expected gap, 
where the constant in the 0(1) depends only on a and not on the parameters of the distribution. 



Dominance and shadow. For the next two settings, we need some additional definitions. Let -< be a partial 
order over the universe X. As defined earlier, the shadow of an element x G X is the set of all the points 
that it dominates, V(x) = {y : y -< x}; likewise the shadow of a sample S C X is T>(S) = U a;e 5'D(x). 
Let U(x) = {y : x ~< y} be the set of points that shadow x; U(S) for a set S is defined similarly. Note that 
U(x) is a subset of X \ V(x) and ji{U(x)) is the probability that a random sample covers x. 

Model 2: Random tree setting (Section |4]>. While the previous setting was in a continuous domain, next 
we consider a discrete setting, where the goal is to maximize the cardinality of the shadow set. Specifically, 
our universe X is the set of all nodes in an rooted infinite-depth binary tree. The following random process 
generates samples. Let < p < 1. We start at the root and move left or right at every step with equal 
probability. At every step, with probability p we terminate the process and output the current node. A node 
x in the tree dominates another node y if and only if y lies on the unique path from the root to x. For a set 
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S of nodes, we define coverage as Cover(S) = |X>(5)|. Note, that unlike in the previous setting, there is no 
notion of a gap in this setting. 

Once again the threshold rules we consider here are based on sequences of integers {q}. For any such 
sequence, we define X{ = {x : \T>(x)\ > c{\. We show that constant-factor optimality can be achieved with 
exponential or smaller selection overheads. 

Theorem 2 For the binary tree model described above, any threshold rule based on a sequence {cj} with 
Ci = 0{poly{i)) achieves an O(l) competitive ratio in expectation with respect to coverage, as well as an 
0(1) approximation to the expected coverage. 

Model 3: Skyline problem (Section HJ. Finally, we consider another continuous domain that is a gener- 
alization of the skylike problem mentioned previously. We are interested in selecting a set of samples with a 
large shadow. Specifically, we define the "gap" of S to be Gap(S) = 1 — p(T>(S)). Our goal is to minimize 
the gap. 

We show that a natural class of threshold rules obtains near-optimal gaps in this setting. Recall that U{x) 
for an element x 6 X denotes the set of elements that dominate x. We consider threshold rules of the form 
Xi = {x G X : p(U(x)) < Ci} for some sequence of numbers {q}. We require the following continuity 
assumption on the measure p,. 

Definition 1 (Measure continuity) For all x € X and c £ [0, p{U{x))\, there exists an element y 6 U{x) 
such that p(U(y)) = c. Furthermore, there exist elements x, x € X with U(x) = X and Uix) = 0. 

Measure continuity ensures that the sets Xi are all non-empty and proper subsets of each other. 

Theorem 3 For the skyline setting with an arbitrary measure satisfying measure continuity, any threshold 
rule based on a sequence {q} with Cj = i~(V 2 — ^C 1 )) achieves a 1 + o(l) competitive ratio in expectation 
with respect to the gap. 

We note that the class of functions c, specified in the above theorem includes all functions for which 
1 / Cj grows subpolynomially. In particular, this includes threshold rules with selection overheads that are 
slightly superlinear. 

For the special case of the skyline setting over a two-dimensional unit square [0, l] 2 bestowed with 
a product distribution and the usual precedence ordering — (#i,2/i) -< (#2,2/2) ^ anc ^ onr y ^ x i — x 2 
and ?/i < 2/2 — we are able to obtain a stronger result that guarantees constant-factor optimality for any 
polynomial selection overhead: 

Theorem 4 For the skyline setting on the unit square with any product distribution, any threshold rule based 
on a sequence {cj} with Cj = il(l/poly(i)) achieves a 1 + o(l) competitive ratio in expectation with respect 
to the gap. 

3 Sample selection in one dimension 

We will now prove Theorem Q] For a (random) variable x € [0, 1], let x denote its complement 1 — x. 
For a cumulative distribution p with domain [0,1], we use ~p to denote the cumulative distribution for the 
complementary random variable: Jl(x) = 1 — p(l — x). 

Let Y denote a draw from the power-law distribution p,, and Y n denote the (random) quality of the nth 
sample selected by X. Note that since p is a power-law distribution, Y i is statistically identical to CiY. 
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Then, the mean quality gap of the first n selected samples is given by Gap n = ^ Ya=i ^ = n ^2i=i °>Y, 
and, by linearity of expectation we have 

E\Y] n ~ 1 

£[GapJ = c i ■ ^) 

n i=0 

On the other hand, the following lemma gives the optimal mean quality achievable when we pick a 
subsample of size n from a sample of size E[T n ]. See Appendix lAl for the proof. 

Lemma 5 The expected mean gap of the largest n out of E[T n ] samples drawn from a distribution with 
~p(x) = x k is 

1 ( n \ X ' k 

(2) 



1 + 1/k \E[T n ] + l t 

First, we bound (fl]) in terms of © by noting that the expected selection overhead of X is given by 

n n 

E[T n ] = j2yHci) = J2i/ci . 

i=l i=l 

Lemma 6 For selection thresholds Ci = l/i a with < a < 1, we have 

1 / n 



E[Gap n ] < 

I- a \E\Tn\_ 

Proof: The proof follows from the Euler-Maclaurin formula and the fact that E[X] < 1. 

E[G aPn ].(E[T n ])^=[^J2i^) (£, 




1 — a 



Lemmas |5] and [6] together show that the expected mean gap of these thresholds rules is only a small 
constant factor bigger than the optimal offline selection rule that picks the best n out of E[T n ] samples. 
Finally we show that these threshold rules are in fact constant factor optimal in the following stricter sense: 
if an adversary were allowed to choose any n out of T n+ \ — 1, its expected mean gap is only a constant factor 
smaller than that of the online algorithm. We denote this optimal offline gap by Gap* +1 . We are interested 
in T n+ \ — 1 is because the adversary should be able to use the samples we rejected while we were waiting 
for the (n + l)th selection. 

Lemma 7 For Cj satisfying o L = l/i a with < a < 1 for all i, we have 

l/k 



E[Ga P ' n+1 ] > 1 (^j) 
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Proof: By Markov's inequality we have 



E[G a p* n ] > ^[Gap; : T n+1 < 2E[T n+l )} 



11 n 
> -- 



2(1 + l/k) \2E[T n+1 



l/k 



1 + - 



> 



1 / n x ' 



16 

Together with Lemma[6l this proves Theorem Q] with the constant in the O(l) equal to 16/(1 — a). 



4 Sample selection in binary trees 

We prove Theorem [2] in two parts: (1) the "fast-growing thresholds" case, that is, a = 0(poly(i)) and 
Ci > logi for all i, and, (2) the "slow-growing thresholds" case, that is, c, < q/ 2 + 0(1) for all i. 

We begin with some notation and observations. Recall that for a node x in the tree T>(x) denotes both 
the unique path from the root to x as well as the set of nodes covered by x (the shadow of x). Let T>k{x) be 
the fcth node on T>(x), 22<fe be the first k nodes of V(x), and X> >fc = T>(x) \ 2?<fe. 

We say that a set of n paths associated with nodes x\ , . . . , x n is independent at level & if | U" =1 Pfc (xj) | = 
n. That is, no two paths share the same vertex at level k, and are disjoint after level k. We have the following 
fact 



Fact 8 If a set ofn paths {T>(xi)}, of length > k! each, is independent at level k < k! , then 



\V{{xi,...,x n })\ 



i=i 



> 



1=1 



> n(k' - k) 



Our analysis depends on whether c, is a slow-growing or fast-growing function. We first consider the 
case of ci = 0(poly n) but with c,; > log 2 i for all i. 

Theorem 9 For the binary tree model described above, any threshold rule based on a sequence {cj}, with 
Ci = 0(poly{i)) and Cj > logifor all i, achieves an 0(1) competitive ratio in expectation with respect to 
coverage, as well as an 0(1) approximation to the expected coverage. 

Proof: Let f(i) = Ci — log i. We will first obtain an upper bound on Cover* . Let S n be the n selected nodes, 
O n the optimal set of n paths, and R n the paths that are rejected and are not covered by S n . 



Cover; = \(V(O n )nV(R n )) U (V(O n ) nv(s n ))\ 

< \V{O n )C\V(R n )\ + \V{S n )\ 

< (2n + nf(n)) + Cover n , 
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Here the last inequality follows by noting that V(O n ) n T>(R n ) forms a binary tree with at most 2n vertices 
in the first log n levels, and at most nf(n) other vertices since it is the union of n paths of length at most 

log n + f(n). 

Next we obtain a lower bound on Cover„. Consider the last n/2 selected nodes s n / 2+1 , . . . , s n and 
their paths D(s n / 2+ i), ■■■ ,D(s n ). By definition, \T>(s n/2+i )\ > c n/2 = log n/2 + /(n/2). Let N = 

I Aogn/2( s n/2+i)l be the number of paths T>(s n / 2 +i) that are independent at level log n/2. Since 
Aogn/2( s n/2+t) chooses from each of the n/2 nodes at level logn/2 equiprobably, N has the same distri- 
bution as the number of occupied bins when n/2 balls are thrown into n/2 bins uniformly at random. The 
expected number of unoccupied bins is n/2e. By Markov's inequality, with probability at least 1/2, the 
number of empty bins is at most 2^. So, we have that Pr[iV > f (1 — 2/e)] > 1/2. Thus, we have 



E 



Covert 
Cover n 



< E 



< E 



Cover* n , 

a :N>-(1- 2/e) 

Cover„ ~ 2 V 11 



n 



+ Pi\N < -(1 



2/e)] 



n(2 + /(n)) + Cover n _ > n _ 

Cover n ' ~ 2 y 1 ' 



1 

+ 2 



and by Fact[U 



3 

< - 
~ 2 



2(2 + /(»)) 
(1 - 2/e)/(n/2) 



O(l) 



where the constant depends on /(n). Likewise we can obtain a bound on the approximation factor by noting 
that 
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E [Cover n ] > E |Cover n : N > — (1 — 2/e) J • Pr[iV > -(1 - 2/e)] 
>~(l-2/e)/(n/2) 



and therefore, 



E [Cover;] < i + 2n + n/(n) = Q ^ 



E[Cover n ] " is(l-2/e)/(n/2) 
where once again the constant depends on /(n). ■ 

Next we consider the case when a is a slow-growing function. In particular we assume that a < Ci/% + 0(1) 
for all i. 

Theorem 10 For the binary tree model described above, any threshold rule based on a sequence {cj}, with 
c% < Cj/2 + Oil) for all i, achieves an 0{\) competitive ratio in expectation with respect to coverage, as 
well as an 0(1) approximation to the expected coverage. 

Proof: We follow the outline of the previous proof. Consider the last n/2 selected nodes s n / 2 +i, ■ ■ ■ , s n . If 
c n/2 > log (n/2), we consider the number of independent paths at level log(n/2) and the proof goes through 

exactly as before. So suppose that c n / 2 < log(n/2). Let N = \ £>c n/2 ( s n/2+i)l- Then there are at most 
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2 c «/2 < n/2 nodes at level c n / 2 , so we can use the same balls-and-bins argument as in the previous proof to 
obtain Pr[iV > 2 C ™/ 2 (1 - 2/e)] > 1/2. 

Since there are at most 2 Cn+1 nodes in the first c„ levels of a binary tree, and Cover n > N, we have that 



E 



Cover* 
Cover,, 



< E 



2 cn+i + Cover,, 
Cover n 

3 2 C " +1 
~ 2 + 2 C «/ 2 (1 - 2/e) 



: jV > 2 C 'V2(1 - 2/e) 
= 0(1) 



1 

+ 2 



We can also use the same argument as in the previous proof to prove the claimed bound on the approximation 
factor: 

E [Cover;] . _ 2 C ™ +1 



E [Cover n ] 



< 1 + 



. 2 c «/2 . (i _ 2/e) 



0(1) 



5 Sample selection in the skyline model 

In this section we focus on the following "skyline" model. We first consider the case where the universe 
X is the unit square [0, l] 2 and (xi,yi) -< (£2,2/2) for (xi,?/i), (2:2 > 2/2) £ X if and only if x\ < X2 and 
2/1 < 2/2- In Section [5^21 we discuss general spaces. 

5.1 Uniform and product distributions over 2 dimensions 

As mentioned earlier, we consider threshold rules of the form Xi = {(x,y) G X : p,(U(x,y)) < c,} for 
some sequence of numbers {cj}, where U(x, y) is the set of points that dominate (x, y). 

For simplicity, we first prove the following version of Theorem |4] for the uniform distribution over X, 
and then describe how it extends to general product distributions. 

Theorem 11 For the skyline setting on the unit square with the uniform distribution, any threshold rule 
based on a sequence {cj} with Cj = Q(l/poly(i)) achieves a 1 + o(l) competitive ratio in expectation with 
respect to the gap. 

Let S n denote the set of samples selected by the (implicit) threshold rule out of the set T n of samples 
seen. Let R n = T n \S n denote the samples rejected by the threshold rule, and O n denote an optimal subset 
of T n of size n. Recall that our goal is to maximize the shadow of the selected subsample, and so all points 
in O n must be undominated by other points. Let £ n denote the event that O n contains a point in R n , that is, 
there is a point in Rn \ V(S n ). It is immediate that Gap n ^ Gap n if and only if the event £ n happens. We 
will show that the event £ n occurs with very low probability and use this fact to prove Theorem [TT] 

We first show how the approximation factor and expected competitive ratio with respect to the gap of a 
threshold rule relates to the probability of the event £ n . 

Lemma 12 For the skyline model with an arbitrary distribution, the gap of a threshold rule based on the 
sequence {q} satisfies the following, where £ n is the event that Gap n ^ Gap* n , we have 

< 1 + — Pr[£ n ] . 



En 



Gap n 
Gap* 
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Proof: We apply B ayes' rule to get 



gggn 

Gap! 



<1 + E U 



Gap, 
Gap; 



• f 



If the event £ n happens then by definition O n contains a point in R n , say x. Then, Gap* = Gap(O n ) > 
fi(U(x)) > c n , where the last inequality follows from noting that x X n . On the other hand, Gap n is 
always less than 1. Therefore the claim follows. ■ 

To complete the proof of Theorem [TT] we give an upper bound on the probability of the event £ n . Our 
goal is to show that with high probability, every sample in R n is dominated by some sample in S n . We start 
with a simple observation about the number of rejected samples. 

Fact 13 E[\R n \] < E[T n ] = l/fi(Xi) < n/c n . 

Fact 14 Let u be the uniform measure over [0, l] 2 . Then for all n, n{X n ) = c n (l + In l/c n ) . 
We first note that many of the samples in S n are in fact in X n . 

Lemma 15 Let a be a constant satisfying > afar large enough i. Then with probability 1 — o(c n ), 
an/ 4 of the samples in S n belong to X n . 



Proof: Consider samples belonging to S n D X n / 2 ', these are at least n/2 in number. We claim that a constant 
fraction of these are in X n with high probability. In particular, 



Pr [x E X n : x G X n / 2+i ] - . 



fi(Xn) n(X n ) 

> — r— r > a , 



1+lnl/c 



Here we used the fact that T 

l+lnl/c n/2 

an/ 2. Since this is a binomial random variable, by Chernoff bounds, 



> 1. Therefore, the expected number of samples in S n n X n is at least 



Pr 



\S n nX n \< -a(n/2) 



< exp 



-a(n/2) 



/2 } = 0(Cn) , 



since Ci = 0(l/poly(i)). 



The following lemma shows that given sufficient number of samples in S n belonging to X n , with high 
probability R n is dominated by these samples. For the next lemma, let £' n denote the event that at least one 
point in R n is not dominated by S n PI X n and let z = \ S n (~1 X n \ . 

We first state the following consequence of measure continuity. 

Fact 16 Let u satisfy measure continuity. Then for all k € N, and for all y ^ X^, we have fi(U(y) f] X^) > 

Proof: By measure continuity, there exists a z G U{y) such that p,(U{z)) = cu < u(U{y)). Thus, we have 
that z G X k , U(z) C U(y) n X k , and so n{U{y) n X k ) > u{U(z)) = c k . U 
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Lemma 17 Conditioned on z, we have 



Proof: For any sample y in R n , the probability that it is dominated by a uniformly random sample in X n is 

Pr [y e V(x) :xEX n ] = > C - 



z 



So, the probability that y is not dominated by any point in S n n Af n is ^1 — ^ ) . Since this bound 
holds regardless of the specific value of y, applying Wald's identity, 

Pr[<5vJ < -E[number of samples in R n not dominated by S n n X n ] 



Finally we are ready to prove Theorem [TT1 

Proof of Theorem 177} Using Lemma fl2l we have that 

^[GapjGap;] < 1 + — (Pr[£ n : z > a(n/4)] + Pv[z < a(n/4)]) . 

where z = \S n D X n \. By Lemma [131 the second term in the parentheses is o(c re ). Event £ n implies event 
£' n . So applying Lemma [T71 and Fact[T3]we get 

Pr[£ n : z > a(n/4)] < exp {"^y} " E W R n\\ 
< exp{— a(n/4)} • — 



= o(c n ) 

since Cj = 0(l/poly(i)). 



General product distributions. We now consider the skyline model with \i being an arbitrary product 
distribution. In particular, for a point (a, b) G X, let //(a, 6) = /j, x (a) n y (b) for one-dimensional measures 
and Our proof of Theorem |4] is nearly identical to our argument for the uniform case. We note first 
that as before 

<1 + — Pr[£ n ] . 

To bound the probability of £ n , we give a reduction from the product measure setting to the uniform 
measure setting. In particular, consider mapping X into X' = [0, 1] x [0, 1] by mapping a point (a, b) € X 
to (/j, x (a), n y {b)) £ X' . Then it is easy to see that Xi in X gets mapped to X^ in X' for the same sequence 
{cj}. Then, the probability of the event £ n under the transformation remains the same as before, and is once 
again o(c n ). 



E„ 



GgPn 
Gap* 
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5.2 General spaces 



In this section, we show how our results from the skyline model generalize and prove Theorem [3] 

Once again we note that Lemmas IT21 and [P71 carry over to this setting. So our main approach is to bound 
the probability of the event £' n . The main difference from the previous analysis is that the best bound on the 
size of Xi we can obtain is q < fi(Xi) < 1. This means that we can no longer claim that a constant fraction 
of the samples in S n belong to X n . Instead we will show that under a stronger condition on the sequence 
{ci}, namely a = U(l/i € ), the number of samples in X n is f2(nc n ) with a high probability. This will suffice 
to give us the bound we need. In particular, we have the following weaker version of Lemma [T51 

Lemma 18 For Ci = Q(l/i e ) with e < 1, we have that 

Pr[\S n n X n \ < nc n /2] = o(c n ) . 

Proof: For all i < n, we have that 

Pr [x € X n : x € Xj\ > fi(X n ) > c n . 

Thus, by Chernoff bounds, 

< expi-nin 1 -")} = o(c n ) . 

■ 

Finally, we use the previous approach to prove Theorem [3] 
Proof of Theorem^} By Lemma [T2l we have 

^[GapjGap;] < 1 + — (Pr[5 n : z > a(n/4)] + Pr[z < a(n/4)]) , 

where z = \S n fl X n \. Using Fact [13] and Lemma [T71 we have that 

Pr[£ n : z > nc n /2] < Pi[£' n : z > nc n /2] 

^{Sn)}- EW] 
<exp{-n-n- 2 W-W)) } . n/Cn 

= o(c n ) , 

where the last inequality follows by q = i~( 1 / 2 ~ r2 ( 1 )). Together with Lemma IT~8l this proves Theorem|3] 



Pr 



S n H X n < 



nc % 
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A Missing proofs 

We present the technical proofs we skipped over in the main article. 



13 



A.l Sample selection in one dimension 

In section [3j we had some technical derivations that we prove in detail here. 

Proof of Lemma^s Let G(x) = x k be the distribution of the gap of any sample. The expectation of the mth 
order statistic for t n samples from this distribution is as follows. See Claim [Qbelow for a proof. 

_ r(t w + i)r(m + i/fc) 
l " (m)J r(t n + i + i/£0r(m)' 

So, we have that the expected mean gap of the n smallest-gap subsamples out of t n samples is 

Era 
i=l X (i) 



i=i 



T(t n + 1) A r(m+ 1/fc) 

u/nj r(t n + i + iA) A, r( 

m=l 



771 



we postpone the proof of the following step to Claim |2]below, 

T(t n + 1) nT(n+l + l/k) 



l/n) 



r(t n + l + l/fc) (1 + l/fc)r(n + 1) 
1 / n N ' 



where the last line follows by equation 6.1.46 of HI. 

We now prove the two claims missing in the above proof. 
Claim 1 The expectation of the mth order statistic with F(x) = x k is 

T(t n + l)r(m + 1/fc) 



£[Z(m)] 



r(t n + i + i/fc)r(m) 



Proof: From page 236 of ITT41 . we have that the mth order statistic of a sample of size t n from a population 
having continuous distribution function F(x) and probability distribution function f(x) has the probability 
distribution function: 

[F(x {m) )] m - l [l - F(x {m) ]^- m f( X(!n) )dx im) 



T(m)T(t n -m + l) 
So, we have that the expectation of the mth order statistic with F(x) = x k is 

*[*<m)] = [ Hm) ■ r(m) r(t-m+l) ' ^^M)]" 1 " 1 • [1 " ^(m)]^ " /(*(m))<k(»0 

r(t„ + 1) 



T(m)T(t n -m + l)J ^ [ ^ l ^ (m) 
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using the substitution y = x k , we have 

r(t n + 1) n 



T(m)T(t n - m + 1) J 

r(t n + 1) 



B(m + l/k,t n - m + 1) 



r(m)r(t n - m + 1) 

where £?() is the Beta function 

_ T(t n + 1) Yijn + l/fc)r(t w - m + 1) 

~ r(m)r(t n - m + 1) ' T(t n + 1 + 1/k) 

_ r(t n + i)r(m + i/fc) 
~ r(t n + i + i/fc)r(m) 



Claim 2 



Ar(m + l/fc)_ nT(n+ 1 + 1/fc) 
^ r(m) ~ (l/fc + l)r(n + l) 



m=l 

Proof: To simplify a sum involving gamma functions, we can use the idea that 



roo 

r( s )= / t'-h^dt 

Jo 
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Now, we apply the Laplace transform. Using s = m+a, and then interchanging summation and integration, 
we get 

n poo f m 

Y,r(m + a)/T(m)= / £ -dt 

m=l m=0 

t m 



e y e-'—dt 

ml 

m=0 

i 

t a Pr[Poisson(i) < n]dt 
t a Pi[G(n) > t]dt, 

where G(n) is a Gamma(n) random variable. The last identity follows by the Poisson process. Plugging in 



o 



o 



poo 

Pr[G(n) >t] = J u n - X (T u du 



and then interchanging the order of integration gives an integral that evaluates to 

nT{n+ 1 + l/k) 
(l/k + l)r(n + 1) ' 
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